Using Virtual Document for NTCIR-4 Web Information Retrieval Task
نویسندگان
چکیده
The Web is a large collection of heterogeneous pages. Web documents are not always descriptive and accurate in content. In addition, a significant difference between the problems of Web search and traditional text search is the availability of hyperlinks between pages. A page on the Web might possibly be cited by or cite other pages. When evaluating a page, the neighborhood of the page might be a part of the input. In this paper, in addition to the explicit information unit (page content), a new information unit, a virtual document, is introduced in our systems, which is mainly organized by the associated anchor-text of in-bounds links to a page and also its title data. We analyzed the utility of virtual document for Web searching. Three searching function based on virtual document are proposed in our study: • We propose a way to weight query terms through term entropy in the virtual document collection space. • Ranking algorithms are configured to index virtual documents as separate queryable data and our system considers it an important indicator of page relevance in addition to the relevance score of practical Web documents. • Our system implements a modified version of link analysis which employs literal matching between information units of the NTCIR-4 Web data. The experiment results show: Our Web searching system that uses our proposed ranking function works well. The new information units, virtual documents, play an important role in improving information retrieval results. Query term weighting using term entropy on virtual document space is effective in improving searching results. The combination of evidence from actual documents and virtual documents can improve searching results beyond either information source alone. The Query-independent score, which is calculated by our proposed link analysis model, could also obtain modest improvements through our tentative re-ranking methods.
منابع مشابه
Overview of the Topical Classification Task at NTCIR-4 WEB
This paper gives an overview of the Topical Classification Task 1 that was conducted from 2003 to 2004 as one of the pilot experiments of the WEB Task at the Fourth NTCIR Workshop (‘NTCIR-4 WEB’). In this Topical Classification Task, we attempted to assess the effectiveness of automatic classification systems for retrieved documents from Web search engine systems from a viewpoint of topical rel...
متن کاملOverview of WEB Task at the Fourth NTCIR Workshop
This paper gives an overview of the WEB Task at the Fourth NTCIR Workshop (‘NTCIR-4 WEB’) conducted from 2003 to 2004. Through the NTCIR-4 WEB, we investigated the evaluation methods used to measure some tasks of Web information access, such as information retrieval, information classification, and information extraction. We used a 100-gigabyte document dataset that was mainly gathered from the...
متن کاملExperiments on Web Retrieval Driven by Spontaneously Spoken Queries
Motivated to realize the speech-driven information retrieval systems that accept spontaneously spoken queries, we developed a method to collect such speech data derived from the pre-defined search topics that had been systematically constructed for IR research. In order to evaluate both our method and the performance of the document retrieval by using the spontaneously spoken queries, we took p...
متن کاملOverview of the Informational Retrieval Task at NTCIR-4 WEB
This paper gives an overview of the Informational Retrieval Task 2 that was conducted from 2003 to 2004 as a subtask of the WEB Task at the Fourth NTCIR Workshop (‘NTCIR-4 WEB’). In the Informational Retrieval Task, we attempted to assess the retrieval effectiveness of Web search engine systems from a viewpoint of topical relevance, and to build a re-usable test collection suitable for evaluati...
متن کاملOverlapping Clustering Method Using Local and Global Importance of Feature Terms at NTCIR-4 WEB Task
In NTCIR-4 WEB Task D (Topical Classification Task), we present an overlapping clustering method for a Japanese meta search engine as an alternative to listing of ranked retrieved results, which most search engines adopt to present the retrieval results. The proposed method clusters the retrieved results dynamically according to two steps: (1) cluster labels consisting of the most important fea...
متن کامل